feat: added example FastAPI-based inference server for Qwen-ASR#31
Open
kyr0 wants to merge 1 commit intoQwenLM:mainfrom
Open
feat: added example FastAPI-based inference server for Qwen-ASR#31kyr0 wants to merge 1 commit intoQwenLM:mainfrom
kyr0 wants to merge 1 commit intoQwenLM:mainfrom
Conversation
|
AI-generated garbage |
Author
|
@RomiVu Have you even tried it? You have 2 contributions this year and you react like this on a working solution that has already gathered a few stars? I really wonder how bitter you must feel.. https://github.com/kyr0/fast-qwen-asr-inference-vllm |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addressing #15 and a few other questions, I've implemented, tested and benchmarked Qwen-ASR on a 1x NVIDIA H200 NVL throughly, coming up with this inference server implementation that is both simple and fully-featured. This might serve as a boilerplate for more decent implementations -- I believe this is quite a good sweet-spot right now. It does scale well under load, it is configurable, yet still easy to understand. I've also implemented readiness and simple monitoring/SRE features. Forced Aligner is supported as well -- every feature documented in the examples folder should be addressable with this easily. Also, the server.py is volume mounted; so you don't need to rebuild the container on app code change -- another DX improvement. The container also loads the model in HF_HOME of the host. Last but not least, I've provided local and remote reference audios that were generated with ... Qwen-TTS :)
I hope this will reduce the load of issues opened because of confusion.
Requirements: NVIDIA Container Toolkit should be installed on the host (!).